800 research outputs found

    Comparing Bayes Model Averaging and Stacking When Model Approximation Error Cannot be Ignored

    Get PDF
    We compare Bayes Model Averaging, BMA, to a non-Bayes form of model averaging called stacking. In stacking, the weights are no longer posterior probabilities of models; they are obtained by a technique based on cross-validation. When the correct data generating model (DGM) is on the list of models under consideration BMA is never worse than stacking and often is demonstrably better, provided that the noise level is of order commensurate with the coefficients and explanatory variables. Here, however, we focus on the case that the correct DGM is not on the model list and may not be well approximated by the elements on the model list. We give a sequence of computed examples by choosing model lists and DGM’s to contrast the risk performance of stacking and BMA. In the first examples, the model lists are chosen to reflect geometric principles that should give good performance. In these cases, stacking typically outperforms BMA, sometimes by a wide margin. In the second set of examples we examine how stacking and BMA perform when the model list includes all subsets of a set of potential predictors. When we standardize the size of terms and coefficients in this setting, we find that BMA outperforms stacking when the deviant terms in the DGM ‘point’ in directions accommodated by the model list but that when the deviant term points outside the model list stacking seems to do better. Overall, our results suggest the stacking has better robustness properties than BMA in the most important settings

    Reference priors for exponential families with increasing dimension

    Get PDF
    In this article, we establish the asymptotic normality of the posterior distribution for the natural parameter in an exponential family based on independent and identically distributed data. The mode of convergence is expected Kullback-Leibler distance and the number of parameters p is increasing with the sample size n. Using this, we give an asymptotic expansion of the Shannon mutual information valid when p = pn increases at a sufficiently slow rate. The second term in the asymptotic expansion is the largest term that depends on the prior and can be optimized to give Jeffreys’ prior as the reference prior in the absence of nuisance parameters. In the presence of nuisance parameters, we find an analogous result for each fixed value of the nuisance parameter. In three examples, we determine the rates at which pn can be allowed to increase while still retaining asymptotic normality and the reference prior property

    MANIFEST CHARACTERIZATION AND TESTING FOR CERTAIN LATENT PROPERTIES

    Get PDF
    Work due to Junker and more recently due to Junker and Ellis characterized desired latent properties of an educational testing procedure in terms of a collection of other manifest properties. This is important because one can only propose tests for manifest quantities, not latent ones. Here, we complete the conversion of a pair of latent properties to equivalent conditions in terms of four manifest quantities and identify a general method for producing tests for manifest properties

    Reference priors for exponential families with increasing dimension

    Get PDF
    In this article, we establish the asymptotic normality of the posterior distribution for the natural parameter in an exponential family based on independent and identically distributed data. The mode of convergence is expected Kullback-Leibler distance and the number of parameters p is increasing with the sample size n. Using this, we give an asymptotic expansion of the Shannon mutual information valid when p = pn increases at a sufficiently slow rate. The second term in the asymptotic expansion is the largest term that depends on the prior and can be optimized to give Jeffreys’ prior as the reference prior in the absence of nuisance parameters. In the presence of nuisance parameters, we find an analogous result for each fixed value of the nuisance parameter. In three examples, we determine the rates at which pn can be allowed to increase while still retaining asymptotic normality and the reference prior property

    CLOSED FORM EXPRESSIONS FOR BAYESIAN SAMPLE SIZE

    Get PDF
    Sample size criteria are often expressed in terms of the concentration of the posterior density, as controlled by some sort of error bound. Since this is done pre-experimentally, one can regard the posterior density as a function of the data. Thus, when a sample size criterion is formalized in terms of a functional of the posterior, its value is a random variable. Generally, such functionals have means under the true distribution. We give asymptotic expressions for the expected value, under a fixed parameter, for certain types of functionals of the posterior density in a Bayesian analysis. The generality of our treatment permits us to choose functionals that encapsulate a variety of inference criteria and large ranges of error bounds. Consequently, we get simple inequalities which can be solved to give minimal sample sizes needed for various estimation goals. In several parametric examples, we verify that our asymptotic bounds give good approximations to the expected values of the functionals they approximate. Also, our numerical computations suggest our treatment gives reasonable result

    EnsCat: clustering of categorical data via ensembling

    Get PDF
    Background: Clustering is a widely used collection of unsupervised learning techniques for identifying natural classes within a data set. It is often used in bioinformatics to infer population substructure. Genomic data are often categorical and high dimensional, e.g., long sequences of nucleotides. This makes inference challenging: The distance metric is often not well-defined on categorical data; running time for computations using high dimensional data can be considerable; and the Curse of Dimensionality often impedes the interpretation of the results. Up to the present, however, the literature and software addressing clustering for categorical data has not yet led to a standard approach. Results: We present software for an ensemble method that performs well in comparison with other methods regardless of the dimensionality of the data. In an ensemble method a variety of instantiations of a statistical object are found and then combined into a consensus value. It has been known for decades that ensembling generally outperforms the components that comprise it in many settings. Here, we apply this ensembling principle to clustering. We begin by generating many hierarchical clusterings with different clustering sizes. When the dimension of the data is high, we also randomly select subspaces also of variable size, to generate clusterings. Then, we combine these clusterings into a single membership matrix and use this to obtain a new, ensembled dissimilarity matrix using Hamming distance. Conclusions: Ensemble clustering, as implemented in R and called EnsCat, gives more clearly separated clusters than other clustering techniques for categorical data. The latest version with manual and examples is available at https://github.com/jlp2duke/EnsCat

    A Bayes Interpretation of Stacking for M-Complete and M-Open Settings

    Get PDF
    In M-open problems where no true model can be conceptualized, it is common to back off from modeling and merely seek good prediction. Even in M-complete problems, taking a predictive approach can be very useful. Stacking is a model averaging procedure that gives a composite predictor by combining individual predictors from a list of models using weights that optimize a cross validation criterion. We show that the stacking weights also asymptotically minimize a posterior expected loss. Hence we formally provide a Bayesian justification for cross-validation. Often the weights are constrained to be positive and sum to one. For greater generality, we omit the positivity constraint and relax the ‘sum to one’ constraint

    A Bayesian test for excess zeros in a zero-inflated power series distribution

    Get PDF
    Power series distributions form a useful subclass of one-parameter discrete exponential families suitable for modeling count data. A zero-inflated power series distribution is a mixture of a power series distribution and a degenerate distribution at zero, with a mixing probability pp for the degenerate distribution. This distribution is useful for modeling count data that may have extra zeros. One question is whether the mixture model can be reduced to the power series portion, corresponding to p=0p=0, or whether there are so many zeros in the data that zero inflation relative to the pure power series distribution must be included in the model i.e., p≥0p\geq0. The problem is difficult partially because p=0p=0 is a boundary point. Here, we present a Bayesian test for this problem based on recognizing that the parameter space can be expanded to allow pp to be negative. Negative values of pp are inconsistent with the interpretation of pp as a mixing probability, however, they index distributions that are physically and probabilistically meaningful. We compare our Bayesian solution to two standard frequentist testing procedures and find that using a posterior probability as a test statistic has slightly higher power on the most important ranges of the sample size nn and parameter values than the score test and likelihood ratio test in simulations. Our method also performs well on three real data sets.Comment: Published in at http://dx.doi.org/10.1214/193940307000000068 the IMS Collections (http://www.imstat.org/publications/imscollections.htm) by the Institute of Mathematical Statistics (http://www.imstat.org

    Modeling Association in Microbial Communities with Clique Loginear Models

    Get PDF
    There is a growing awareness of the important roles that microbial communities play in complex biological processes. Modern investigation of these often uses next generation sequencing of metagenomic samples to determine community composition. We propose a statistical technique based on clique loglinear models and Bayes model averaging to identify microbial components in a metagenomic sample at various taxonomic levels that have significant associations. We describe the model class, a stochastic search technique for model selection, and the calculation of estimates of posterior probabilities of interest. We demonstrate our approach using data from the Human Microbiome Project and from a study of the skin microbiome in chronic wound healing. Our technique also identifies significant dependencies among microbial components as evidence of possible microbial syntrophy

    Modeling Association in Microbial Communities with Clique Loginear Models

    Get PDF
    There is a growing awareness of the important roles that microbial communities play in complex biological processes. Modern investigation of these often uses next generation sequencing of metagenomic samples to determine community composition. We propose a statistical technique based on clique loglinear models and Bayes model averaging to identify microbial components in a metagenomic sample at various taxonomic levels that have significant associations. We describe the model class, a stochastic search technique for model selection, and the calculation of estimates of posterior probabilities of interest. We demonstrate our approach using data from the Human Microbiome Project and from a study of the skin microbiome in chronic wound healing. Our technique also identifies significant dependencies among microbial components as evidence of possible microbial syntrophy
    • …
    corecore